Search Results: "Keith Packard"

16 June 2014

Keith Packard: Altos1.4

AltOS 1.4 TeleGPS support, features and bug fixes Bdale and I are pleased to announce the release of AltOS version 1.4. AltOS is the core of the software for all of the Altus Metrum products. It consists of firmware for our cc1111, STM32L151, LPC11U14 and ATtiny85 based electronics and Java-based ground station software. This is a major release of AltOS, including support for our new TeleGPS board and a host of new features and bug fixes AltOS Firmware TeleGPS added, new features and fixes

Our new tracker, TeleGPS, works quite differently than a flight computer

Starts tracking and logging at power-on
Disables RF and logging only when connected to USB
Doesn t log position when it isn t moving for a long time.

TeleGPS transmits our digital telemetry protocol, APRS and radio direction finding beacons. For TeleMega, we ve made the firing time for the additional pyro channels (A-D) configurable, in case the default (50ms) isn t long enough. AltOS Beeping Changes The three-beep startup tones have been replaced with a report of the current battery voltage. This is nice on all of the board, but particularly useful with EasyMini which doesn t have the benefit of telemetry reporting its state. We also changed the other state tones to Farnsworth spacing. This makes them all faster, and easier to distinguish from the numeric reports of voltage and altitude. Finally, we ve added the ability to change the frequency of the beeper tones. This is nice when you have two Altus Metrum flight computers in the same ebay and want to be able to tell the beeps apart. AltOS Bug Fixes Fixed a bug which prevented you from using TeleMega s extra pyro channel Flight State After configuration value. AltOS 1.3.2 on TeleMetrum v2.0 and TeleMega would reset the flight number to 2 after erasing flights; that s been fixed. AltosUI New Maps, igniter tab and a few fixes With TeleGPS tracks now potentially ranging over a much wider area than a typical rocket flight, the Maps interface has been updated to include zooming and multiple map styles. It also now uses less memory, which should make it work on a wider range of systems. For TeleMega, we ve added an Igniter tab to the flight monitor interface so you can check voltages on the extra pyro channels before pushing the button. We re hoping that the new Maps interface will load and run on machines with limited memory for Java applications; please let us know if this changes anything for you. TeleGPS All new application just for TeleGPS While TeleGPS shares the same telemetry and data logging capabilities as all of the Altus Metrum flight computers, its use as a tracker is expected to be both broader and simpler than the rocketry-specific systems. We ve build a custom TeleGPS application that incorporates the mapping and data visualization aspects of AltosUI, but eliminates all of the rocketry-specific flight state tracking.

27 April 2014

Keith Packard: Glamor performance

Current Glamor Performance I finally managed to get a Gigabyte Brix set up running Debian so that I could do some more reasonable performance characterization of Glamor in its current state. I wanted to use this particular machine because it has enough cooling to keep from thermally throttling the CPU/GPU package. This is running my glamor-server branch of the X server, which completes the core operation rework and then has some core X server performance improvements as well for filled and outlined arcs. Changes in X11perf First off, I did some analysis of what x11perf was doing and found that it wasn t quite measuring what we thought. I m not interested in competing on x11perf numbers absolutely, I m only interested in doing relative measurements of useful operations, so fixing the tool to measure what I want seems reasonable to me. When x11perf was first written, it drew 100x100 rectangles tight against one another without any gap. And, it filled the window with them, drawing a 6x6 grid of 100x100 rectangles in a 600x600 window. To better exercise the rectangle code and check edge conditions better, we added a one pixel gap between the rectangles. However, we didn t reduce the number of rectangles drawn, so we ended up drawing 11 of the 36 rectangles on top of the first set of 25. Simple region computations would allow the X server to draw only 25 most of the time, skipping the redundant rectangles. The vertical and horizontal line tests were added a while after the first set of tests, and were done without regard to how an X server might optimize for them. x11perf draws these lines packed tightly together, creating a single square of pixels for the result. EXA, UXA and SNA all take vertical and horizontal lines and convert them to rectangles, then take the rectangles and clip them against the window clip list by computing a region from them and intersecting that with the GC composite clip. It s a completely reasonable plan, however, when you take what x11perf was drawing and run it through this code, you end up with a single solid rectangle. Which is surprisingly fast, compared with drawing individual lines. I fixed the overlapping rectangle case by reducing the number of boxes drawn from 36 to 25, and I fixed the vertical and horizontal line case by spacing the lines a pixel apart. I ve pushed out these changes to my x11perf repository on freedesktop.org. What s Fast Things that match GL s capabilities are fast, things which don t are slow. No surprises there. What s interesting is precisely what matches GL Patterns For Free Because GL makes it easy to program fill patterns into the GPU, there are essentially no performance differences between solid and patterned operations. GL Lines Glamor uses GL lines, which can be programmed to match X semantics, to quite good effect. The only trick required was to deal with cap styles. GL never draws the final pixel in a line, while X does unless the cap style is CapNotLast. The solution was to draw an extra line segment containing a single pixel at the end of every joined set of lines for this case. The other implicit requirement is that all zero width lines look the same. Right now, I ve solved that for fill styles and raster ops as they re all drawn with the same GL operations. However, for plane masks, we re currently falling back to software, which may draw something different. Fixing that isn t impossible, it s just tedious. Text Pushing all of the work of drawing core text into glamor wasn t terribly difficult, and the results are pretty spectacular. What s Slow We ve still got room for improvement in Glamor, but there aren t any obvious show-stoppers to getting great performance for reasonable X applications anymore. Wide Lines and Arcs One of the speed-ups I ve made in my glamor branch is to merge all of drawing of multiple filled and zero-width arcs into a single giant GL request. That turned out to both improve performance and save a bit of code. Right now, drawing wide lines and wide arcs doesn t do this, and so we suffer from submitting many smaller requests to GL. It s hard to get excited about speeding any of this up as all of the wide primitives are essentially unused these days. Filled Polygons Because X only lets applications draw a single polygon in each request, Glamor can t really gain any efficiency from batching work unless we start looking ahead in the X protocol stream to see if the next request is another polygon. Alternatively, we could leave the span operation pending to see if more spans were coming before the X serve went idle. Neither of these is all that exciting though; X polygons just aren t that useful. Render Operations These are still not structured to best fit modern GL; some work here would help a bunch. We ve got a gsoc student ready to go at this though, so I expect we ll have much better numbers in a few months. Window Operations You wouldn t think that moving and resizing windows would be so limited by drawing performance, but x11perf tests these with tiny little windows, and each operation draws or copies only a couple of little rectangles, which makes GL quite expensive. Working on speeding up GL for small numbers of operations would help a bunch here. Unexpected Results Solid rectangles are actually running slower than patterned rectangles, and I really have no idea why. The CPU is close to idle during the 500x500 solid rectangle test (as you d expect, given the workload), the vertex and fragment shaders look correct out of the compiler, and yet solid rectangles run at only 0.80 of the performance of the patterned rectangles. GL semantics for copying data essentially preclude overlapping blts of any form. There s the NVtexturebarrier extension which at least lets us do blts within the same object, but even that doesn t define how overlapping blts work. So, we have to create a temporary copy for this operation to make it work right. I was worried that this would slow things down, but the Iris Pro 3D engine is enough faster than the 2D engine that even with the extra copy, large scrolls and copies within the same object are actually faster. Results Here s a giant image showing the ratio of Glamor to both UXA and SNA running on the same machine, with all of the same software; the only change between runs was to switch the configured acceleration architecture.

6 April 2014

Keith Packard: Java-Sound-on-Linux

Java Sound on Linux I m often in the position of having my favorite Java program (AltosUI) unable to make any sounds. Here s a history of the various adventures I ve had. Java and PulseAudio ALSA support When we started playing with Java a few years ago, we discovered that if PulseAudio were enabled, Java wouldn t make any sound. Presumably, that was because the ALSA emulation layer offered by PulseAudio wasn t capable of supporting Java. The fix for that was to make sure pulseaudio would never run. That s harder than it seems; pulseaudio is like the living dead; rising from the grave every time you kill it. As it s nearly impossible to install any desktop applications without gaining a bogus dependency on pulseaudio, the solution that works best is to make sure dpkg never manages to actually install the program with dpkg-divert:

# dpkg-divert --rename /usr/bin/pulseaudio

With this in place, Java was a happy camper for a long time. Java and PulseAudio Native support More recently, Java has apparently gained some native PulseAudio support in some fashion. Of course, I couldn t actually get it to work, even after running the PulseAudio daemon but some kind Debian developer decided that sound should be broken by default for all Java applications and selected the PulseAudio back-end in the Java audio configuration file. Fixing that involved learning about said Java audio configuration file and then applying patch to revert the Debian packaging damage.

$ cat /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/sound.properties
...
#javax.sound.sampled.Clip=org.classpath.icedtea.pulseaudio.PulseAudioMixerProvider
#javax.sound.sampled.Port=org.classpath.icedtea.pulseaudio.PulseAudioMixerProvider
#javax.sound.sampled.SourceDataLine=org.classpath.icedtea.pulseaudio.PulseAudioMixerProvider
#javax.sound.sampled.TargetDataLine=org.classpath.icedtea.pulseaudio.PulseAudioMixerProvider
javax.sound.sampled.Clip=com.sun.media.sound.DirectAudioDeviceProvider
javax.sound.sampled.Port=com.sun.media.sound.PortMixerProvider
javax.sound.sampled.SourceDataLine=com.sun.media.sound.DirectAudioDeviceProvider
javax.sound.sampled.TargetDataLine=com.sun.media.sound.DirectAudioDeviceProvider

You can see the PulseAudio mistakes at the top of that listing, with the corrected native interface settings at the bottom. Java and single-open ALSA drivers It used to be that ALSA drivers could support multiple applications having the device open at the same time. Those with hardware mixing would use that to merge the streams together; those without hardware mixing might do that in the kernel itself. While the latter is probably not a great plan, it did make ALSA a lot more friendly to users. My new laptop is not friendly, and returns EBUSY when you try to open the PCM device more than once. After downloading the jdk and alsa library sources, I figured out that Java was trying to open the PCM device multiple times when using the standard Java sound API in the simplest possible way. I thought I was going to have to fix Java, when I figured out that ALSA provides user-space mixing with the dmix plugin. I enabled that on my machine and now all was well.

$ cat /etc/asound.conf
pcm.!default  
    type plug
    slave.pcm "dmixer"
 
pcm.dmixer   
    type dmix
    ipc_key 1024
    slave  
        pcm "hw:1,0"
        period_time 0
        period_size 1024
        buffer_size 4096
        rate 44100
     
    bindings  
        0 0
        1 1
     
 
ctl.dmixer  
    type hw
    card 1
 
ctl.!default  
    type hw
    card 1

As you can see, my sound card is not number 0, it s number 1, so if your card is a different number, you ll have to adapt as necessary.

22 March 2014

Keith Packard: glamor-core

Core Rendering with Glamor I ve hacked up the intel driver to bypass all of the UXA paths when Glamor is enabled so I m currently running an X server that uses only Glamor for all rendering. There are still too many fall backs, and performance for some operations is not what I d like, but it s entirely usable. It supports DRI3, so I even have GL applications running. Core Rendering Status I ve continued to focus on getting the core X protocol rendering operations complete and correct; those remain a critical part of many X applications and are a poor match for GL. At this point, I ve got accelerated versions of the basic spans functions, filled rectangles, text and copies. GL and Scrolling OpenGL has been on a many-year vendetta against one of the most common 2D accelerated operations copying data within the same object, even when that operation overlaps itself. This used to be the most performance-critical operation in X; it was used for scrolling your terminal windows and when moving windows around on the screen. Reviewing the OpenGL 3.x spec, Eric and I both read the glCopyPixels specification as clearly requiring correct semantics for overlapping copy operations it says that the operation must be equivalent to reading the pixels and then writing the pixels. My CopyArea acceleration thus uses this path for the self-copy case. However, the ARB decided that having a well defined blt operation was too nice to the users, so the current 4.4 specification adds explicit language to assert that this is not well defined anymore (even in the face of the existing language which is pretty darn unambiguous). I suspect we ll end up creating an extension that offers what we need here; applications are unlikely to stop scrolling stuff around, and GPUs (at least Intel) will continue to do what we want. This is the kind of thing that makes GL maddening for 2D graphics the GPU does what we want, and the GL doesn t let us get at it. For implementations not capable of supporting the required semantic, someone will presumably need to write code that creates a temporary copy of the data. PBOs for fall backs For operations which Glamor can t manage, we need to fall back to using a software solution. Direct-to-hardware acceleration architectures do this by simply mapping the underlying GPU object to the CPU. GL doesn t provide this access, and it s probably a good thing as such access needs to be carefully synchronized with GPU access, and attempting to access tiled GPU objects with the CPU require either piles of CPU code to de-tile accesses (ala wfb), or special hardware detilers (like the Intel GTT). However, GL does provide a fairly nice abstraction called pixel buffer objects (PBOs) which work to speed access to GPU data from the CPU. The fallback code allocates a PBO for each relevant X drawable, asks GL to copy pixels in, and then calls fb, with the drawable now referencing the temporary buffer. On the way back out, any potentially modified pixels are copied back through GL and the PBOs are freed. This turns out to be dramatically faster than malloc ing temporary buffers as it allows the GL to allocate memory that it likes, and for it to manage the data upload and buffer destruction asynchronously. Because X pixmaps can contain many X windows (the root pixmap being the most obvious example), they are often significantly larger than the actual rendering target area. As an optimization, the code only copies data from the relevant area of the pixmap, saving considerable time as a result. There s even an interface which further restricts that to a subset of the target drawable which the Composite function uses. Using Scissoring for Clipping The GL scissor operation provides a single clipping rectangle. X provides a list of rectangles to clip to. There are two obvious ways to perform clipping here either perform all clipping in software, or hand each X clipping rectangle in turn to GL and re-execute the entire rendering operation for each rectangle. You d think that the former plan would be the obvious choice; clearly re-executing the entire rendering operation potentially many times is going to take a lot of time in the GPU. However, the reality is that most X drawing occurs under a single clipping rectangle. Accelerating this common case by using the hardware clipper provides enough benefit that we definitely want to use it when it works. We could duplicate all of the rendering paths and perform CPU-based clipping when the number of rectangles was above some threshold, but the additional code complexity isn t obviously worth the effort, given how infrequently it will be used. So I haven t bothered. Most operations look like this:

Allocate VBO space for data
Fill VBO with X primitives
loop over clip rects  
    glScissor()
    glDrawArrays()

This obviously out-sources as much of the problem as possible to the GL library, reducing the CPU time spent in glamor to a minimum. A Peek at Some Code With all of these changes in place, drawing something like a list of rectangles becomes a fairly simple piece of code: First, make sure the program we want to use is available and can be used with our GC configuration:

prog = glamor_use_program_fill(pixmap, gc,
                               &glamor_priv->poly_fill_rect_program,
                               &glamor_facet_polyfillrect);
if (!prog)
    goto bail_ctx;

Next, allocate the VBO space and copy all of the X data into it. Note that the data transfer is simply memcpy here that s because we break the X objects apart in the vertex shader using instancing, avoiding the CPU overhead of computing four corner coordinates.

/* Set up the vertex buffers for the points */
v = glamor_get_vbo_space(drawable->pScreen, nrect * (4 * sizeof (GLshort)), &vbo_offset);
glEnableVertexAttribArray(GLAMOR_VERTEX_POS);
glVertexAttribDivisor(GLAMOR_VERTEX_POS, 1);
glVertexAttribPointer(GLAMOR_VERTEX_POS, 4, GL_SHORT, GL_FALSE,
                      4 * sizeof (GLshort), vbo_offset);
memcpy(v, prect, nrect * sizeof (xRectangle));
glamor_put_vbo_space(screen);

Finally, loop over the pixmap tile fragments, and then over the clip list, selecting the drawing target and painting the rectangles:

glEnable(GL_SCISSOR_TEST);
glamor_pixmap_loop(pixmap_priv, box_x, box_y)  
    int nbox = RegionNumRects(gc->pCompositeClip);
    BoxPtr box = RegionRects(gc->pCompositeClip);
    glamor_set_destination_drawable(drawable, box_x, box_y, TRUE, FALSE, prog->matrix_uniform, &off_x, &off_y);
    while (nbox--)  
        glScissor(box->x1 + off_x,
                  box->y1 + off_y,
                  box->x2 - box->x1,
                  box->y2 - box->y1);
        box++;
        glDrawArraysInstanced(GL_TRIANGLE_STRIP, 0, 4, nrect);

GL texture size limits X pixmaps use 16 bit dimensions for width and height, allowing them to be up to 65536 x 65536 pixels. Because the X coordinate space is signed, only a quarter of this space is actually useful, which makes the useful size of X pixmaps only 32767 x 32767. This is still larger than most GL implementations offer as a maximum texture size though, and while it would be nice to just say we don t allow pixmaps larger than GL textures , the reality is that many applications expect to be able to allocate such pixmaps today, largely to hold the ever increasing size of digital photographs. Glamor has always supported large X pixmaps; it does this by splitting them up into tiles, each of which is no larger than the largest texture supported by the driver. What I ve added to Glamor is some simple macros that walk over the array of tiles, making it easy for the rendering code to support large pixmaps without needing any special case code. Glamor also had some simple testing support you can compile the code to ignore the system-provided maximum texture size and supply your own value. This code had gone stale, and couldn t work as there were parts of the code for which tiling support just doesn t make sense, like the glyph cache, or the X scanout buffer. I fixed things so that you could leave those special cases as solitary large tiles while breaking up all other pixmaps into tiles no larger than 32 pixels square. I hope to remove the single-tile case and leave the code supporting only the multiple-tile case; we have to have the latter code, and so having the single-tile code around simply increases our code size for not obvious benefit. Getting accelerated copies between tiled pixmaps added a new coordinate system to the mix and took a few hours of fussing until it was working. Rebasing Many (many) Times I m sure most of us remember the days before git; changes were often monolithic, and the notion of changing how the changes were made for the sake of clarity never occurred to anyone. It used to be that the final code was the only interesting artifact; how you got there didn t matter to anyone. Things are different today; I probably spend a third of my development time communicating how the code should change with other developers by changing the sequence of patches that are to be applied. In the case of Glamor, I ve now got a set of 28 patches. The first few are fixes outside of the glamor tree that make the rest of the server work better. Then there are a few general glamor infrastructure additions. After that, each core operation is replaced, one a at a time. Finally, a bit of stale code is removed. By sequencing things in a logical fashion, I hope to make review of the code easier, which should mean that people will spend less time trying to figure out what I did and be able to spend more time figuring out if what I did is reasonable and correct. Supporting Older Versions of GL All of the new code uses vertex instancing to move coordinate computation from the CPU to the GPU. I m also pulling textures apart using integer operations. Right now, we should correctly fall back to software for older hardware, but it would probably be nicer to just fall back to simpler GL instead. Unless everyone decides to go buy hardware with new enough GL driver support, someone is going to need to write simplified code paths for glamor. If you ve got such hardware, and are interested in making it work well, please take this as an opportunity to help yourself and others. Near-term Glamor Goals I m pretty sure we ll have the code in good enough shape to merge before the window closes for X server 1.16. Eric is in charge of the glamor tree, so it s up to him when stuff is pulled in. He and Markus Wick have also been generating code and reviewing stuff, but we could always use additional testing and review to make the code as good as possible before the merge window closes. Markus has expressed an interest in working on Glamor as a part of the X.org summer of code this year; there s clearly plenty of work to do here, Eric and I haven t touched the render acceleration stuff at all, and that code could definitely use some updating to use more modern GL features. If that works as well as the core rendering code changes, then we can look forward to a Glamor which offers GPU-limited performance for classic X applications, without requiring GPU-specific drivers for every generation of every chip.

7 March 2014

Keith Packard: glamor-hacking

Brief Glamor Hacks Eric Anholt started writing Glamor a few years ago. The goal was to provide credible 2D acceleration based solely on the OpenGL API, in particular, to implement the X drawing primitives, both core and Render extension, without any GPU-specific code. When he started, the thinking was that fixed-function devices were still relevant, so that original code didn t insist upon modern OpenGL features like GLSL shaders. That made the code less efficient and hard to write. Glamor used to be a side-project within the X world; seen as something that really wasn t very useful; something that any credible 2D driver would replace with custom highly-optimized GPU-specific code. Eric and I both hoped that Glamor would turn into something credible and that we d be able to eliminate all of the horror-show GPU-specific code in every driver for drawing X text, rectangles and composited images. That hadn t happened though, until now. Fast forward to the last six months. Eric has spent a bunch of time cleaning up Glamor internals, and in fact he s had it merged into the core X server for version 1.16 which will be coming up this July. Within the Glamor code base, he s been cleaning some internal structures up and making life more tolerable for Glamor developers. Using libepoxy A big part of the cleanup was a transition all of the extension function calls to use his other new project, libepoxy, which provides a sane, consistent and performant API to OpenGL extensions for Linux, Mac OS and Windows. That library is awesome, and you should use it for everything you do with OpenGL because not using it is like pounding nails into your head. Or eating non-tasty pie. Using VBOs in Glamor One thing he recently cleaned up was how to deal with VBOs during X operations. VBOs are absolutely essential to modern OpenGL applications; they re really the only way to efficiently pass vertex data from application to the GPU. And, every other mechanism is deprecated by the ARB as not a part of the blessed core context . Glamor provides a simple way of getting some VBO space, dumping data into it, and then using it through two wrapping calls which you use along with glVertexAttribPointer as follows:

pointer = glamor_get_vbo_space(screen, size, &offset);
glVertexAttribPointer(attribute_location, count, type,
              GL_FALSE, stride, offset);
memcpy(pointer, data, size);
glamor_put_vbo_space(screen);

glamor_get_vbo_space allocates the specified amount of VBO space and returns a pointer to that along with an offset , which is suitable to pass to glVertexAttribPointer. You dump your data into the returned pointer, call glamor_put_vbo_space and you re all done. Actually Drawing Stuff At the same time, Eric has been optimizing some of the existing rendering code. But, all of it is still frankly terrible. Our dream of credible 2D graphics through OpenGL just wasn t being realized at all. On Monday, I decided that I should go play in Glamor for a few days, both to hack up some simple rendering paths and to familiarize myself with the insides of Glamor as I m getting asked to review piles of patches for it, and not understanding a code base is a great way to help introduce serious bugs during review. I started with the core text operations. Not because they re particularly relevant these days as most applications draw text with the Render extension to provide better looking results, but instead because they re often one of the hardest things to do efficiently with a heavy weight GPU interface, and OpenGL can be amazingly heavy weight if you let it. Eric spent a bunch of time optimizing the existing text code to try and make it faster, but at the bottom, it actually draws each lit pixel as a tiny GL_POINT object by sending a separate x/y vertex value to the GPU (using the above VBO interface). This code walks the array of bits in the font and checking each one to see if it is lit, then checking if the lit pixel is within the clip region and only then adding the coordinates of the lit pixel to the VBO. The amazing thing is that even with all of this CPU and GPU work, the venerable 6x13 font is drawn at an astonishing 3.2 million glyphs per second. Of course, pure software draws text at 9.3 million glyphs per second. I suspected that a more efficient implementation might be able to draw text a bit faster, so I decided to just start from scratch with a new GL-based core X text drawing function. The plan was pretty simple:

Dump all glyphs in the font into a texture. Store them in 1bpp format to minimize memory consumption.
Place raw (integer) glyph coordinates into the VBO. Place four coordinates for each and draw a GL_QUAD for each glyph.
Transform the glyph coordinates into the usual GL range (-1..1) in the vertex shader.
Fetch a suitable byte from the glyph texture, extract a single bit and then either draw a solid color or discard the fragment.

This makes the X server code surprisingly simple; it computes integer coordinates for the glyph destination and glyph image source and writes those to the VBO. When all of the glyphs are in the VBO, it just calls glDrawArrays(GL_QUADS, 0, 4 * count). The results were encouraging :

1: fb-text.perf
2: glamor-text.perf
3: keith-text.perf
       1                 2                           3                 Operation
------------   -------------------------   -------------------------   -------------------------
   9300000.0      3160000.0 (     0.340)     18000000.0 (     1.935)   Char in 80-char line (6x13) 
   8700000.0      2850000.0 (     0.328)     16500000.0 (     1.897)   Char in 70-char line (8x13) 
   6560000.0      2380000.0 (     0.363)     11900000.0 (     1.814)   Char in 60-char line (9x15) 
   2150000.0       700000.0 (     0.326)      7710000.0 (     3.586)   Char16 in 40-char line (k14) 
    894000.0       283000.0 (     0.317)      4500000.0 (     5.034)   Char16 in 23-char line (k24) 
   9170000.0      4400000.0 (     0.480)     17300000.0 (     1.887)   Char in 80-char line (TR 10) 
   3080000.0      1090000.0 (     0.354)      7810000.0 (     2.536)   Char in 30-char line (TR 24) 
   6690000.0      2640000.0 (     0.395)      5180000.0 (     0.774)   Char in 20/40/20 line (6x13, TR 10) 
   1160000.0       351000.0 (     0.303)      2080000.0 (     1.793)   Char16 in 7/14/7 line (k14, k24) 
   8310000.0      2880000.0 (     0.347)     15600000.0 (     1.877)   Char in 80-char image line (6x13) 
   7510000.0      2550000.0 (     0.340)     12900000.0 (     1.718)   Char in 70-char image line (8x13) 
   5650000.0      2090000.0 (     0.370)     11400000.0 (     2.018)   Char in 60-char image line (9x15) 
   2000000.0       670000.0 (     0.335)      7780000.0 (     3.890)   Char16 in 40-char image line (k14) 
    823000.0       270000.0 (     0.328)      4470000.0 (     5.431)   Char16 in 23-char image line (k24) 
   8500000.0      3710000.0 (     0.436)      8250000.0 (     0.971)   Char in 80-char image line (TR 10) 
   2620000.0       983000.0 (     0.375)      3650000.0 (     1.393)   Char in 30-char image line (TR 24)

This is our old friend x11perfcomp, but slightly adjusted for a modern reality where you really do end up drawing billions of objects (hence the wider columns). This table lists the performance for drawing a range of different fonts in both poly text and image text variants. The first column is for Xephyr using software (fb) rendering, the second is for the existing Glamor GL_POINT based code and the third is the latest GL_QUAD based code. As you can see, drawing points for every lit pixel in a glyph is surprisingly fast, but only about 1/3 the speed of software for essentially any size glyph. By minimizing the use of the CPU and pushing piles of work into the GPU, we manage to increase the speed of most of the operations, with larger glyphs improving significantly more than smaller glyphs. Now, you ask how much code this involved. And, I can truthfully say that it was a very small amount to write:

 Makefile.am             2 
 glamor.c                5 
 glamor_core.c           8 
 glamor_font.c         181 ++++++++++++++++++++
 glamor_font.h          50 +++++
 glamor_priv.h          26 ++
 glamor_text.c         472 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 glamor_transform.c      2 
 8 files changed, 741 insertions(+), 5 deletions(-)

Let s Start At The Very Beginning The results of optimizing text encouraged me to start at the top of x11perf and see what progress I could make. In particular, looking at the current Glamor code, I noticed that it did all of the vertex transformation with the CPU. That makes no sense at all for any GPU built in the last decade; they ve got massive numbers of transistors dedicated to performing precisely this kind of operation. So, I decided to see what I could do with PolyPoint. PolyPoint is absolutely brutal on any GPU; you have to pass it two coordinates for each pixel, and so the very best you can do is send it 32 bits, or precisely the same amount of data needed to actually draw a pixel on the frame buffer. With this in mind, one expects that about the best you can do compared with software is tie. Of course, the CPU version is actually computing an address and clipping, but those are all easily buried in the cost of actually storing a pixel. In any case, the results of this little exercise are pretty close to a tie the CPU draws 190,000,000 dots per second and the GPU draws 189,000,000 dots per second. Looking at the vertex and fragment shaders generated by the compiler, it s clear that there s room for improvement. The fragment shader is simply pulling the constant pixel color from a uniform and assigning it to the fragment color in this the simplest of all possible shaders:

uniform vec4 color;
void main()
 
       gl_FragColor = color;
 ;

This generates five instructions:

Native code for point fragment shader 7 (SIMD8 dispatch):
   START B0
   FB write target 0
0x00000000: mov(8)          g113<1>F        g2<0,1,0>F                        align1 WE_normal 1Q  ;
0x00000010: mov(8)          g114<1>F        g2.1<0,1,0>F                      align1 WE_normal 1Q  ;
0x00000020: mov(8)          g115<1>F        g2.2<0,1,0>F                      align1 WE_normal 1Q  ;
0x00000030: mov(8)          g116<1>F        g2.3<0,1,0>F                      align1 WE_normal 1Q  ;
0x00000040: sendc(8)        null            g113<8,8,1>F
                render ( RT write, 0, 4, 12) mlen 4 rlen 0        align1 WE_normal 1Q EOT  ;
   END B0

As this pattern is actually pretty common, it turns out there s a single instruction that can replace all four of the moves. That should actually make a significant difference in the run time of this shader, and this shader runs once for every single pixel. The vertex shader has some similar optimization opportunities, but it only runs once for every 8 pixels with the SIMD format flipped around, the vertex shader can compute 8 vertices in parallel, so it ends up executing 8 times less often. It s got some redundant moves, which could be optimized by improving the copy propagation analysis code in the compiler. Of course, improving the compiler to make these cases run faster will probably make a lot of other applications run faster too, so it s probably worth doing at some point. Again, the amount of code necessary to add this path was tiny:

 Makefile.am             1 
 glamor.c                2 
 glamor_polyops.c      116 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 glamor_priv.h           8 +++
 glamor_transform.c    118 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 glamor_transform.h     51 ++++++++++++++++++++++
 6 files changed, 292 insertions(+), 4 deletions(-)

Discussion of Results These two cases, text and points, are probably the hardest operations to accelerate with a GPU and yet a small amount of OpenGL code was able to meet or beat software easily. The advantage of this work over traditional GPU 2D acceleration implementations should be pretty clear this same code should work well on any GPU which offers a reasonable OpenGL implementation. That means everyone shares the benefits of this code, and everyone can contribute to making 2D operations faster. All of these measurements were actually done using Xephyr, which offers a testing environment unlike any I ve ever had build and test hardware acceleration code within a nested X server, debugging it in a windowed environment on a single machine. Here s how I m running it:

$ ./Xephyr  -glamor :1 -schedMax 2000 -screen 1024x768 -retro

The one bit of magic here is the -schedMax 2000 flag, which causes Xephyr to update the screen less often when applications are very busy and serves to reduce the overhead of screen updates while running x11perf. Future Work Having managed to accelerate 17 of the 392 operations in x11perf, it s pretty clear that I could spend a bunch of time just stepping through each of the remaining ones and working on them. Before doing that, we want to try and work out some general principles about how to handle core X fill styles. Moving all of the stipple and tile computation to the GPU will help reduce the amount of code necessary to fill rectangles and spans, along with improving performance, assuming the above exercise generalizes to other primitives. Getting and Testing the Code Most of the changes here are from Eric s glamor-server branch:

git://people.freedesktop.org/~anholt/xserver glamor-server

The two patches shown above, along with a pair of simple clean up patches that I ve written this week are available here:

git://people.freedesktop.org/~keithp/xserver glamor-server

Of course, as this now uses libepoxy, you ll need to fetch, build and install that before trying to compile this X server. Because you can try all of this out in Xephyr, it s easy to download and build this X server and then run it right on top of your current driver stack inside of X. I d really like to hear from people with Radeon or nVidia hardware to know whether the code works, and how it compares with fb on the same machine, which you get when you elide the -glamor argument from the example Xephyr command line above.

17 February 2014

Keith Packard: MicroPeak Approved for NAR Contests

MicroPeak Approved for NAR Contests The NAR Contest Board has approved MicroPeak for use in contests requiring a barometric altimeter starting on the 1st of April, 2014. You can read the announcement message on the contestRoc Yahoo message board here: Contest Board Approves New Altimeter The message was sent out on the 30th of January, but there is a 90 day waiting period after the announcement has been made before you can use MicroPeak in a contest, so the first date approved for contest flights is April 1. After that date, you should see MicroPeak appear in Appendix G of the pink book, which lists the altimeters approved for contest use Thanks much to the NAR contest board and all of the fliers who helped get MicroPeak ready for this!

15 February 2014

Keith Packard: Altos1.3.2

AltOS 1.3.2 Bug fixes and improved APRS support Bdale and I are pleased to announce the release of AltOS version 1.3.2. AltOS is the core of the software for all of the Altus Metrum products. It consists of firmware for our cc1111, STM32L151, LPC11U14 and ATtiny85 based electronics and Java-based ground station software. This is a minor release of AltOS, including bug fixes for TeleMega, TeleMetrum v2.0 and AltosUI . AltOS Firmware GPS Satellite reporting and APRS improved Firmware version 1.3.1 has a bug on TeleMega when it has data from more than 12 GPS satellites. This causes buffer overruns within the firmware. 1.3.2 limits the number of reported satellites to 12. APRS now continues to send the last known good GPS position, and reports GPS lock status and number of sats in view in the APRS comment field, along with the battery and igniter voltages. AltosUI TeleMega GPS Satellite, GPS max height and Fire Igniters AltosUI was crashing when TeleMega reported that it had data from more than 12 satellites. While the TeleMega firmware has been fixed to never do that, AltosUI also has a fix in case you fly a TeleMega board without updated firmware. GPS max height is now displayed in the flight statistics. As the u-Blox GPS chips now provide accurate altitude information, we ve added the maximum height as computed by GPS here. Fire Igniters now uses the letters A through D to label the extra TeleMega pyro channels instead of the numbers 0-3.

1 February 2014

Keith Packard: No Fosdem

Missing FOSDEM I m afraid Eric and I won t be at FOSDEM this weekend; our flight got canceled, and the backup they offered would have gotten us there late Saturday night. It seemed crazy to fly to Brussels for one day of FOSDEM, so we decided to just stay home and get some work done. Sorry to be missing all of the fabulous FOSDEM adventures and getting to see all of the fun people who attend one of the best conferences around. Hope everyone has a great time, and finds only the best chocolates.

23 January 2014

Keith Packard: Altos1.3.1

AltOS 1.3.1 Bug fixes and improved APRS support Bdale and I are pleased to announce the release of AltOS version 1.3.1. AltOS is the core of the software for all of the Altus Metrum products. It consists of firmware for our cc1111, STM32L151, LPC11U14 and ATtiny85 based electronics and Java-based ground station software. This is a minor release of AltOS, including bug fixes for TeleMega, TeleMetrum v2.0 and AltosUI . AltOS Firmware Antenna down fixed and APRS improved Firmware version 1.3 has a bug in the support for operating the flight computer with the antenna facing downwards; the accelerometer calibration data would be incorrect. Furthermore, the accelerometer self-test routine would be confused if the flight computer were moved in the first second after power on. The firmware now simply re-tries the self-test several times. I went out and bought a real APRS radio, the Yaesu FT1D to replace my venerable VX 7R. With this in hand, I changed our APRS support to use the compressed position format, which takes fewer bytes to send, offers increased resolution and includes altitude data. I took the altitude data out of the comment field and replaced that with battery and igniter voltages. This makes APRS reasonably useful in pad mode to monitor the state of the flight computer before boost. Anyone with a TeleMega should update to the new firmware eventually, although there aren t any critical bug fixes here, unless you re trying to operate the device with the antenna pointing downwards. AltosUI TeleMega support and offline map loading improved. I added all of the new TeleMega sensor data as possible elements in the graph. This lets you see roll rates and horizontal acceleration values for the whole flight. The Fire Igniter dialog now lists all of the TeleMega extra pyro channels so you can play with those on the ground as well. Our offline satellite images are downloaded from Google, but they restrict us to reading 50 images per minute. When we tried to download a 9x9 grid of images to save for later use on the flight line, Google would stop feeding us maps after the first 50. You d have to poke the button a second time to try and fill in the missing images. We fixed this by just limiting how fast we load maps, and now we can reliably load an 11x11 grid of images. Of course, there are also a few minor bug fixes, so it s probably worth updating even if the above issues don t affect you.

18 January 2014

James Bromberger: Linux.conf.au 2014: LCA TV

The radio silence here on my blog has been not from lack of activity, but the inverse. Linux.conf.au chewed up the few remaining spare cycles I have had recently (after family and work), but not from organising the conference (been there, got the T-Shirt and the bag). So, let s do a run through of what has happened LCA2014 Perth has come and gone in pretty smooth fashion. A remarkable effort from the likes of the Perth crew of Luke, Paul, Euan, Leon, Jason, Michael, and a slew of volunteers who stepped up not to mention our interstate firends of Steve and Erin, Matthew, James I, Tim the Streaming guy and others, and our pro organisers at Manhattan Events. It was a reasonably smooth ride: the UWA campus was beautiful, the leacture theatres were workable, and the Octogon Theatre was at its best when filled with just shy of 500 like minded people and an accomplished person gracing the stage. What was impressive (to me, at least) was the effort of the AV team (which I was on the extreme edges of); videos of keynotes hit the Linux Australia mirror within hours of the event. Recording and live streaming of all keynotes and sessions happend almost flawlessly. Leon had built a reasonably robust video capture management system (eventstreamer on github) to ensure that people fresh to DVswitch had nothing break so bad it didn t automatically fix itself and all of this was monitored from the Operations Room (called the TAVNOC, which would have been the AV NOC, but somehow a loose reference to the UWA Tavern the Tav crept in there). Some 167 videos were made and uploaded most of this was also mirrored on campus before th end of the conference so attendees could load up their laptops with plenty of content for the return trip home. Euan s quick Blender work meant there was a nice intro and outro graphic, and Leon s scripting ensured that Zookeepr, the LCA conference manegment software, was the source of truth in getting all videos processed and tagged correctly. I was scheduled (and did give) a presentation at LCA 2014 about Debian on Amazon Web Services (on Thursday), and attended as many of the sessions as possible, but my good friend Michael Davies (LCA 2004 chair, and chair of the LCA Papers Committee for a good many years) had another role for this year. We wanted to capture some of the hallway track of Linux.conf.au that is missed in all the videos of presentations. And thus was born LCA TV. LCA TV consisted of the video equipment for an additional stream mixer host, cameras, cables and switches, hooking into the same streaming framework as the rest of the sessions. We took over a corner of the registration room (UWA Undercroft), brought in a few stage lights, a couch, coffee table, seat, some extra mics, and aimed to fill the session gaps with informal chats with some of the people at Linux.conf.au speakers, attendees, volunteers alike. And come they did. One or two interviews didn t succeed (this was an experiment), but in the end, we ve got over 20 interviews with some interesting people. These streamed out live to the people watching LCA from afar; those unable to make it to Perth in early January; but they were recorded too and we can start to watch them (see below) I was also lucky enough to mix the video for the three keynotes as well as the opening and closing, with very capable crew around the Octogon Theatre. As the curtain came down, and the 2014 crew took to the stage to be congratulated by the attendees, I couldn t help but feel a little bit proud and a touch nostalgic memories from 11 years earlier when LCA 2003 came to a close in the very same venue. So, before we head into the viewing season for LCA TV, let me thank all the volunteers who organised, the AV volunteers, the Registration volunteers, the UWA team who helped with Octogon, Networking, awesome CB Radios hooked up to the UWA repeated that worked all the way to the airport. Thanks to the Speakers who submitted proposals. The Speakers who were accepted, made the journey and took to the stage. The people who attended. The sponsors who help make this happen. All of the above helps share the knowledge, and ultimately, move the community forward. But my thanks to Luke and Paul for agreeing to stand there in the middle of all this madness and hive of semi structured activity that just worked. Please remember this was experimental; the noise was the buzz of the conference going on around us. There was pretty much only one person on the AV kit my thanks to Andrew Cooks who I ll dub as our sound editor, vision director, floor manager, and anything else. So who did we interview?

Alan Robertson (Assim Proj)
Arjen Lentz (twice well, two topics!)
Daniel (A student at LCA for the first time)
Dave Chinner (XFS)
Erin Walsh (Rego desk manager)
Jason Nicholls (AV Director LCA 2014)
Jeremy Kerr (Kernel Developer)
Jessica Smith (Astronomy Mini Conf)
Jono Oxer (Audosat)
Karen Sandler (Gnome)
Keith Packard (X) and BDale Garbee (Freedom Box, Debian)
Kevin Vinsen (ICRAR, Square Kilometer Array)
Lennart Poettering (SystemD)
Linus Torvalds (Yet another Kernel Developer)
Matthew Wilcox (Another Kernel dev, and a Debian Dev as well)
Michael Still (OpenStack)
Paul Weyper (Canberra Linus Users Group)
Paul Wise (Debian)
Pia Waugh (Open Government)
Rusty Russel (Yet another Kernel Developer! Oh, and started LCA in 1999)

One or two talks did not work, so appologies to those that are missing. Here s the playlist to start you off! Enjoy.

14 January 2014

Keith Packard: gl-and-bitmaps

X bitmaps vs OpenGL Of course, you all know that X started life as a monochrome window system for the VS100. Back then, bitmaps and rasterops were cool; you could do all kinds of things simple bit operations. Things changed, and eventually X bitmaps became useful only off-screen for clip masks, text and stipples. These days, you ll rarely see anyone using a bitmap everything we used to use bitmaps for has gone all alpha-values on us. In OpenGL, there aren t any bitmaps. About the most bitmap-like object you ll find is an A8 texture, holding 8 bits of alpha value for each pixel. There s no way to draw to or texture from anything where each pixel is represented as a single bit. So, as Eric goes about improving Glamor, he got a bit stuck with bitmaps. We could either:

Support them only on the CPU, uploading copies as A8 textures when used as a source in conjunction with GPU objects.
Support them as 1bpp on the CPU and A8 on the GPU, doing fancy tracking between the two objects when rendering occurred.
Fix the CPU code to deal with bitmaps stored 8 bits per pixel.

I thought the latter choice would be the best plan directly share the same object between CPU and GPU rendering, avoiding all reformatting as things move around in the server. Why is this non-trivial? Usually, you can flip formats around with reckless abandon in X, it has separate bits-per-pixel and depth values everywhere. That s how we do things like 32 bits-per-pixel RGB surfaces; we just report them as depth 24 and everyone is happy. Bitmaps are special though. The X protocol has separate (and overly complicated) image formats for single bit images, and those have to be packed 1 bit per pixel. Within the server, bitmaps are used for drawing core text, stippling and creating clip masks. They re the lingua franca of image formats, allowing you to translate between depths by pulling a single plane out of a source image and painting it into a destination of arbitrary depth. As such, the frame buffer rendering code in the server assumes that bitmaps are always 1 bit per pixel. Given that it must deal with 1bpp images on the wire, and given the history of X, it certainly made sense at the time to simplify the code with this assumption. A blast from the past I d actually looked into doing this before. As part of the DECstation line, DEC built the MX monochrome frame buffer board, and to save money, they actually created it by populating a single bit in each byte rather than packed into 8 pixels per byte. I have this vague memory that they were able to use only 4 memory chips this way. The original X driver for this exposed a depth-8 static color format because of the assumptions made by the (then current) CFB code about bitmap formats. Jim Gettys wandered over to MIT while the MX frame buffer was in design and asked how hard it would be to support it as a monochrome device instead of the depth-8 static color format. At the time, fixing CFB would have been a huge effort, and there wasn t any infrastructure for separating the wire image format from the internal pixmap format. So, we gave up and left things looking weird to applications. Hacking FB These days, the amount of frame buffer code in the X server is dramatically less; CFB and MFB have been replaced with the smaller (and more general) FB code. It turns out that the number of places which need to deal with individual bits in a bitmap are now limited to a few stippling and CopyPlane functions. And, in those functions, the number of individual read operations from the bitmap are few in number. Each of those fetches looked like:

bits = READ(src++)

All I needed to do was make this operation return 32 bits by pulling one bit from each of 8 separate 32-bit chunks and merge them together. The first thing to do was to pad the pixmap out to a 32 byte boundary, rather than a 32 bit boundary. This ensured that I would always be able to fetch data from the bitmap in 8 32-bit chunks. Next, I simply replaced the READ macro call with:

    bits = fb_stip_read(src, srcBpp);
    src += srcBpp;

The new fb_stip_read function checks srcBpp and packs things together for 8bpp images:

/*
 * Given a depth 1, 8bpp stipple, pull out
 * a full FbStip worth of packed bits
 */
static inline FbStip
fb_pack_stip_8_1(FbStip *bits)  
    FbStip      r = 0;
    int         i;
    for (i = 0; i < 8; i++)  
        FbStip  b;
        uint8_t p;
        b = FB_READ(bits++);
#if BITMAP_BIT_ORDER == LSBFirst
        p = (b & 1)   ((b >> 7) & 2)   ((b >> 14) & 4)   ((b >> 21) & 8);
        r  = p << (i << 2);
#else
        p = (b & 0x80000000)   ((b << 7) & 0x40000000)  
            ((b << 14) & 0x20000000)   ((b << 21) & 0x10000000);
        r  = p >> (i << 2);
#endif
     
    return r;
 
/*
 * Return packed stipple bits from src
 */
static inline FbStip
fb_stip_read(FbStip *bits, int bpp)
 
    switch (bpp)  
    default:
        return FB_READ(bits);
    case 8:
        return fb_pack_stip_8_1(bits);

It turns into a fairly hefty amount of code, but the number of places this ends up being used is pretty small, so it shouldn t increase the size of the server by much. Of course, I ve only tested the LSBFirst case, but I think the MSBFirst code is correct. I ve sent the patches to do this to the xorg-devel mailing list, and they re also on the depth1 branch in my repository

git://people.freedesktop.org/~keithp/xserver.git

Testing Eric also hacked up the test suite to be runnable by piglit, and I ve run it in that mode against these changes. I had made a few mistakes, and the test suite caught them nicely. Let s hope this adventure helps Eric out as he continues to improve Glamor.

2 January 2014

Keith Packard: Present-pixmap-lifetimes-part-deux

Pixmap ID Lifetimes under Present Redirection (Part Deux) I recently wrote about pixmap content and ID lifetimes. I think the pixmap content lifetime stuff was useful, but the pixmap ID stuff was not quite right. I ve just copied the section from the previous post and will update it. PresentRedirect pixmap ID lifetimes (reprise) A compositing manager will want to know the lifetime of the pixmaps delivered in PresentRedirectNotify events to clean up whatever local data it has associated with it. For instance, GL compositing managers will construct textures for each pixmap that need to be destroyed when the pixmap disappears. Present encourages pixmap destruction The PresentPixmap request says:

PresentRegion holds a reference to pixmap until the presentation occurs, so pixmap may be immediately freed after the request executes, even if that is before the presentation occurs.

Breaking this when doing Present redirection seems like a bad idea; it s a very common pattern for existing 2D applications. New pixmap IDs for everyone Because pixmap IDs for present contents are owned by the source application (and may be re-used immediately when freed), Present can t use that ID in the PresentRedirectNotify event. Instead, it must allocate a new ID and send that instead. The server has it s own XID space to use, and so it can allocate one of those and bind it to the same underlying pixmap; the pixmap will not actually be destroyed until both the application XID and the server XID are freed. The compositing manager will receive this new XID, and is responsible for freeing it when it doesn t need it any longer. Of course, if the compositing manager exits, all of the XIDs passed to it will automatically be freed. I considered allocating client-specific XIDs for this purpose; the X server already has a mechanism for allocating internal IDs for various things that are to be destroyed at client shut-down time. Those XIDs have bit 30 set, and are technically invalid XIDs as X specifies that the top three bits of every XID will be zero. However, the cost of using a server ID instead (which is a valid XID) is small, and it s always nice to not intentionally break the X protocol (even as we continue to take advantage of accidental breakages). (Reserving the top three bits of XIDs and insisting that they always be zero was a nod to the common practice of weakly typed languages like Lisp. In these languages, 32-bit object references were typed by using few tag bits (2-3) within the value. Most of these were just pointers to data, but small integers, that fit in the remaining bits (29-30), could be constructed without allocating any memory. By making XIDs only 29 bits, these languages could be assured that all XIDs would be small integers.) Pixmap Destroy Notify The Compositing manager needs to know when the application is done with the pixmap so that it can clean up when it is also done; destroying the extra pixmap ID it was given and freeing any other local resources. When the application sends a FreePixmap request, that XID will become invalid, but of course the pixmap itself won t be freed until the compositing manager sends a FreePixmap request with the extra XID it was given. Because the pixmap ID known by the compositing manager is going to be different from the original application XID, we need an event delivered to the compositing manager with the new XID, which makes this event rather Present specific. We don t need to select for this event; the compositing manager must handle it correctly, so we ll just send it whenever the composting manager has received that custom XID. PresentPixmapDestroyNotify

 
    PresentPixmapDestroyNotify
    type: CARD8         XGE event type (35)
    extension: CARD8        Present extension opcode
    sequence-number: CARD16
    length: CARD32          0
    evtype: CARD16          Present_PixmapDestroyNotify
    eventID: XFIXESEVENTID
    event-window: WINDOW
    pixmap: pixmap

This event is delivered to the clients selecting for SubredirectNotify for pixmaps which were delivered in a PresentRedirectNotify event and for which the originating application Pixmap has been destroyed. Note that the pixmap may still be referenced by other objects within the X server, by a pending PresentPixmap request, as a window background, GC tile or stipple or Picture drawable (among others), but this event serves to notify the selecting client that the application is no longer able to access the underlying pixmap with it s original Pixmap ID.

28 December 2013

Keith Packard: Present-redirect-lifetimes

Object Lifetimes under Present Redirection Present extension redirection passes responsibility for showing application contents from the X server to the compositing manager. This eliminates an extra copy to the composite extension window buffer and also allows the application contents to be shown in the right frame. (Currently, the copy from application buffer to window buffer is synchronized to the application-specified frame, and then a Damage event is delivered to the compositing manager which constructs the screen image using the window buffer and presents that to the screen at least one frame later, which generally adds another frame of delay for the application.) The redirection operation itself is simple just wrap the PresentPixmap request up into an event and send it to the compositing manager. However, the pixmap to be presented is allocated by the application, and hence may disappear at any time. We need to carefully control the lifetime of the pixmap ID and the specific frame contents of the pixmap so that the compositing manager can reliably construct complete frames. We ll separately discuss the lifetime of the specific frame contents from that of the pixmap itself. By the frame contents , I mean the image, as a set of pixel values in the pixmap, to be presented for a specific frame. Present Pixmap contents lifetime After the application is finished constructing a particular frame in a pixmap, it passes the pixmap to the X server with PresentPixmap. With non-redirected Present, the X server is responsible for generating a PresentIdleNotify event once the server is finished using the contents. There are three different cases that the server handles, matching the three different PresentCompleteModes:

Copy. The pixmap contents are not needed after the copy operation has been performed. Hence, the PresentIdleNotify event is generated when the target vblank time has been reached, right after the X server calls CopyArea
Flip. The pixmap is being used for scanout, and so the X server won t be done using it until some other scanout buffer is selected. This can happen as a result of window reconfiguration which makes the displayed window no longer full-screen, but the usual case is when the application presents a subsequent frame for display, and the new frame replaces the old. Thus, the PresentIdleNotify event generally occurs when the target vblank time for the subsequent frame has been reached, right after the subsequent frame s pixmap has been selected for scanout.
Skip. The pixmap contents will never be used, and the X server figures this out when a subsequent frame is delivered with a matching target vblank time. This happens when the subsequent Present operation is queued by the X server.

In the Redirect case, the X server cannot tell when the compositing manager is finished with the pixmap. The same three cases as above apply here, but the results are slightly different:

Composite. The pixmap is being used as a part of the screen image and must be composited with other window pixmaps. In this case, the compositing manager will need to hold onto the pixmap until a subsequent pixmap is provided by the application. Thus, the pixmap will remain needed by the compositing manager until it receives a subsequent PresentRedirectNotify for the same window.
Flip. The compositing manager is free to take the application pixmap and use it directly in a subsequent PresentPixmap operation and let the X server flip to it; this provides a simple way of avoiding an extra copy while not needing to fuss around with unredirecting windows. In this case, the X server will need the pixmap contents until a new scanout pixmap is provided, and the compositing manager will also need the pixmap in case the contents are needed to help construct a subsequent frame.
Skip. In this case, the compositing manager notices that the window s pixmap has been replaced before it was ever used.

In case 2, the X server and the compositing manager will need to agree on when the PresentIdleNotify event should be delivered. In the other two cases, the compositing manager itself will be in charge of that. To let the compositing manager control when the event is delivered, the X server will count the number of PresentPixmap redirection events sent, and the compositing manager will deliver matching PresentIdle requests. PresentIdle

 
    PresentIdle
    pixmap: PIXMAP
 
Errors: Pixmap, Match

Informs the server that the Pixmap passed in a PresentRedirectNotify event is no longer needed by the client. Each PresentRedirectNotify event must be matched by a PresentIdle request for the originating client to receive a PresentIdleNotify event. PresentRedirect pixmap ID lifetimes A compositing manager will want to know the lifetime of the pixmaps delivered in PresentRedirectNotify events to clean up whatever local data it has associated with it. For instance, GL compositing managers will construct textures for each pixmap that need to be destroyed when the pixmap disappears. Some kind of PixmapDestroyNotify event is necessary for this; the alternative is for the compositing manager to constantly query the X server to see if the pixmap IDs it is using are still valid, and even that isn t reliable as the application may re-use pixmap IDs for a new pixmap. It seems like this PixmapDestroyNotify event belongs in the XFixes extension it s a general mechanism that doesn t directly relate to Present. XFixes doesn t currently have any Generic Events associated, but adding that should be fairly straightforward. And then, have the Present extension automatically select for PixmapDestroyNotify events when it delivers the pixmap in a PresentRedirectNotify event so that the client is ensured of receiving the associated PixmapDestroyNotify event. One remaining question I have is whether this is sufficient, or if the compositing manager needs a stable pixmap ID which lives beyond the life of the application pixmap ID. If so, the solution would be to have the X server allocate an internal ID for the pixmap and pass that to the client somehow; presumably in addition to the original pixmap ID. XFixesPixmapSelectInput

XFIXESEVENTID   XID  
    Defines a unique event delivery target for Present
    events. Multiple event IDs can be allocated to provide
    multiple distinct event delivery contexts.
PIXMAPEVENTS   XFixesPixmapDestroyNotify  
 
    XFixesPixmapSelectInput
    eventid: XFIXESEVENTID
    pixmap: PIXMAP
    events: SETofPIXMAPEVENTS
 
Errors: Pixmap, Value

Changes the set of events to be delivered for the target pixmap. A Value error is sent if events contains invalid event selections. XFixesPixmapDestroyNotify

 
    XFixesPixmapDestroyNotify
    type: CARD8         XGE event type (35)
    extension: CARD8        XFixes extension request number
    sequence-number: CARD16
    length: CARD32          0
    evtype: CARD16          XFixes_PixmapDestroyNotify
    eventID: XFIXESEVENTID
    pixmap: pixmap

This event is delivered to all clients selecting for it on pixmap when the pixmap ID is destroyed by a client. Note that the pixmap may still be referenced by other objects within the X server, as a window background, GC tile or stipple or Picture drawable (among others), but this event serves to notify the selecting client that the ID is no longer associated with the underlying pixmap.

19 December 2013

Keith Packard: Altos1.3

AltOS 1.3 TeleMega and EasyMini support Bdale and I are pleased to announce the release of AltOS version 1.3. AltOS is the core of the software for all of the Altus Metrum products. It consists of firmware for our cc1111, STM32L151, LPC11U14 and ATtiny85 based electronics and Java-based ground station software. This is a major release of AltOS as it includes support for both of our brand new flight computers, TeleMega and EasyMini. AltOS Firmware New hardware, new features and fixes Our new advanced flight computer, TeleMega, required a lot of new firmware features, including:

9 DoF IMU (3 axis accelerometer, 3 axis gyroscope, 3 axis compass).
Orientation tracking using the gyroscopes (and quaternions, which are lots of fun!)
APRS support so your existing amateur radio receiver can track the location of your rocket.
Software FEC, both encoding and decoding.
Four fully-programmable pyro channels, in addition to the usual apogee and main channels.
STM32L CPU support. TeleMega needed a more powerful processor. The STM32L is a 32-bit ARM Cortex-M3 which is definitely up to the challenge.

Our new easy-to-use flight computer, EasyMini also uses a new processor, the LPC11U14, which is an ARM Cortex-M0 part. For our existing cc1111 devices, there are some minor bug fixes for the flight software, so you should plan on re-flashing flight units at some point. However, there aren t any incompatible changes, so you don t have to do it all at once. Bug fixes:

More USB fixes for Windows.
Turn off the cc1111 RC oscillator at startup. This may save a bit of power, and may reduce noise inside the chip a bit.

AltosUI Redesigned for TeleMega and EasyMini support AltosUI has also seen quite a bit of work for the 1.3 release, but almost all of that was a massive internal restructuring necessary to support flight computers with a wide range of sensors. From the user s perspective, it s pretty similar with a few changes:

Graphs can now show the raw barometric pressure
Support for TeleMega and EasyMini, including alternate TeleMega pyro channel configuration.
Bug fixes in how data were extracted from a flight record for graphing sometimes values would end up getting plotted out of order, causing weird jaggy lines.

13 December 2013

Keith Packard: xserver-warnings

Cleaning up X server warnings So I was sitting in the Narita airport with a couple of other free software developers merging X server patches. One of the developers was looking over my shoulder while the X server was building and casually commented on the number of warnings generated by the compiler. I felt like I had invited someone into my house without cleaning for months embarrassed and ashamed that we d let the code devolve into this state. Of course, we ve got excuses the X server code base is one of the oldest pieces of regularly used free software in existence. It was started before ANSI-C was codified. No function prototypes, no const , no void * , no enums or stdint.h. There may be a few developers out there who remember those days (fondly, of course), but I think most of us are glad that our favorite systems language has gained a lot of compile-time checking in the last 25 years. We ve spent time in the past adding function prototypes and cleaning up other warnings, but there was never a point at which the X server compiled without any warnings. More recently, we ve added a pile of new warning flags when compiling the X server which only served to increase the number of warnings dramatically. The current situation With the master branch of the X server and released versions of the dependencies, we generate 1047 warnings in the default build. -Wcast-qual considered chatty The GCC flag, -Wcast-qual, complains when you cast a pointer and change the const qualifier status. A very common thing for the X server to do is declare pointers as const to mark them as immutable once assigned. Often, the relevant data is actually constructed once at startup in allocated memory and stored to the data structure. During server reset, that needs to be freed, but free doesn t take a const pointer, so we cast to (void *), which -Wcast-qual then complains about. Loudly. Of the 1047 warnings, 380 of them are generated by this one warning flag. I ve gone ahead and just disabled it in util/macros for now. String constants are a pain The X server uses string constants to initialize defaults for font paths, configuration options, font names along with a host of other things. These end up getting stored in variables that can also take allocated storage. I ve gone ahead and declared the relevant objects as const and then fixed the code to suit. I don t have a count of the number of warnings these changes fixed; they were scattered across dozens of X server directories, and I was fixing one directory at a time, but probably more than half of the remaining warnings were of this form. And a host of other warnings Fixing the rest of the warnings was mostly a matter of stepping through them one at a time and actually adjusting the code. Shadowed declarations, unused values, redundant declarations and missing printf attributes were probably the bulk of them though. Changes to external modules Instead of just hacking the X server code, I ve created patches for other modules where necessary to fix the problems in the right place.

proto/fontsproto. Declares FontPathElement names as const char *
mesa/drm. Adds printf attribute to the debug_print function
util/macros. Removes -Wcast-qual from the default warning set.

Getting the bits In case it wasn t clear, the X server build now generates zero warnings on my machine. I m hoping that this will also be true for other people. Patches are available at:

xserver - git://people.freedesktop.org/~keithp/xserver warning-fixes
fontsproto - git://people.freedesktop.org/~keithp/fontsproto fontsproto-next
mesa/drm - git://people.freedesktop.org/~keithp/drm warning-fixes
util/macros - already upstream on master

Keeping our house clean Of course, these patches are all waiting until 1.15 ships so that we don t accidentally break something important. However, once they re merged, I ll be bouncing any patches which generate warnings on my system, and if other people find warnings when they build, I ll expect them to send patches as well. Now to go collect the tea cups in my office and get them washed along with the breakfast dishes so I won t be embarrassed if some free software developers show up for lunch today.

29 November 2013

Keith Packard: Debian TC

Appointed to the Debian Tech Committee I m pleased to announce that I ve been appointed to serve on the Debain Technical Committee. I d like to thank the other committee members and the Debian project leader, Lucas Nussbaum, for giving me this opportunity to serve. I look forward to working within the committee to further Debian s goals as the universal operating system.

Keith Packard: Black Friday 2013

Back on Black (Friday) Event Altus Metrum is pleased to announce our Back on Black (Friday) event! For the first time since the Black Forest fire in June, we re re-opening our web store this weekend with a host of new and classic Altus Metrum products, including a special pre-order discount on our latest-and-greatest flight computer design, TeleMega. This weekend only, Friday, 29 November 2013 through Monday, 2 December, 2013, the first 40 TeleMega direct orders placed through our web store will receive a special $50 pre-order discount (regular $400, now only $350!).

TeleMega is an advanced flight computer with 9-axis IMU, 6 pyro channels, uBlox Max 7Q GPS and 40mW telemetry system. We designed TeleMega to be the ideal flight computer for sustainers and other complex projects. TeleMega production is currently in process, and we expect to be ready to ship in mid-December. Pre-order now and we won t charge you until we ship. Learn more about TeleMega at: http://altusmetrum.org/TeleMega/ We are also pleased to announce that TeleBT is back in stock. Priced at $150, TeleBT is our latest ground station that connects to your laptop over USB or your Android device over BlueTooth. Learn more about TeleBT at http://altusmetrum.org/TeleBT/ Another new product we re thrilled to announce is EasyMini! Priced at only $80, EasyMini is a two-channel flight computer with built-in data logging and USB data download.

Like our more advanced flight computers, EasyMini is loaded with sophisticated electronics and firmware, designed to be very simple to use yet capable enough for high performance airframes. Perfect as a first flight computer, EasyMini is also great as a backup deployment controller in complext projects. Learn more about EasyMini at: http://altusmetrum.org/EasyMini/ Also in stock for immediate shipment is MicroPeak, our 1.9 gram recording altimeter available for $50. The MicroPeak USB adapter, also $50, has been improved to make data downloading a snap. Read more about these at: http://altusmetrum.org/MicroPeak http://altusmetrum.org/MicroPeakUSB You can learn more about these and all our other Altus Metrum products at http://altusmetrum.org. The special discount on TeleMega pre-orders is available only on orders placed directly through Bdale s web store at http://shop.gag.com Thank you all for your support of Altus Metrum during 2013. It s been a rough year, but we re having a great time updating our existing products and designing new stuff! We look forward to returning products like TeleMetrum and TeleMini to the market soon, and plan to introduce even more new products soon.

19 November 2013

Raphaël Hertzog: Will Debian s technical committee coopt Keith Packard or Philipp Kern?

The process has been ongoing for more than a year but the Debian technical committee is about to select a candidate to recommend for its vacant seat. The Debian Project Leader will then (likely) appoint him (looks like it won t be a women). According to recent discussions on debian-ctte@lists.debian.org, it seems that either Keith Packard or Philipp Kern will join the committee. If you look at the current membership of the committee, you will see:

Bdale Garbee: USA
Russ Allbery: USA
Don Armstrong: USA
Andreas Barth: Germany
Ian Jackson: United Kingdom
Steve Langasek: USA
Colin Watson: United Kingdom

That s very Anglo-Saxon centric (6 out of 7 members). While I trust the current members and while I know that they are open-minded people, it still bothers me to see this important body with so few diversity. Coming back to the choice at hand, Keith Packard is American and Philipp Kern is German. No new country in the mix. I can only hope that Philipp will be picked to bring some more balance in the body.

9 comments Liked this article? Click here. My blog is Flattr-enabled.

28 October 2013

Keith Packard: Quaternions

Tracking Orientation with Quaternions I spent the flight back from china and the weekend adding orientation tracking to AltOS. I d done a bit of research over the last year or so working out the technique, but there s always a big step between reading about something and actually doing it. I know there are a pile of quaternion articles on the net, but I wanted to write down precisely what I did, mostly as a reminder to myself in the future when I need to go fix the code Quaternion Basics Quaternions were invented by Sir William Rowan Hamilton around 1843. It seems to have started off as a purely theoretical piece of math, extending complex numbers from two dimensions to four by introducing two more roots of -1 and defining them to follow:

i  = j  = k  = ijk = -1

Use these new roots to create numbers with four real components, three of which are multiplied by our three roots:

r + ix + jy + kz

With a bit of algebra, you can figure out how to add and multiply these composite values, using the above definition to reduce and combine terms so that you end up with a set which is closed under the usual operations. Then we add a few more definitions, like the conjugate:

q = (r + ix + jy + kz)
q* = (r - ix - jy - kz)

The norm:

  q   =  (qq*) =  (r  + x  + y  + z )

u is a unit quaternion if its norm is one:

  u   = 1

Quaternions and Rotation Ok, so we ve got a cute little 4-dimensional algebra. How does this help with our rotation problem? Let s figure out how to rotate a point in space by an arbitrary rotation, defined by an axis of rotation and an amount in radians. First, take a vector, v , and construct a quaternion, q as follows:

q = 0 + ivx + jvy + kvz

Now, take a unit quaternion u , which represents a vector in the above form along the axis of rotation, and a rotation amount, , and construct a quaternion r as follows:

r = cos  /2 + u sin  /2

With a pile of algebra, you can show that the rotation of q by r is:

q  = r q r*

In addition, if you have two rotations, s and r , then the composite rotation, t , a rotation by r followed by s can be computed with:

q  = s (r q r*) s*
    = (sr) q (r*s*)
    = (sr) q (sr)*
t   = s r
q  = t q t*

That s a whole lot simpler than carrying around a 3x3 matrix to do the rotation, which makes sense as a matrix representation of a rotation has a bunch of redundant information, and it avoids a pile of problems if you try to represent the motion as three separate axial rotations performed in sequence. Computing an initial rotation Ok, so the rocket is sitting on the pad, and it s tilted slightly. I need to compute the initial rotation quaternion based on the accelerometer readings which provide a vector, g pointing up. Essentially, I want to compute the rotation that would take g and make it point straight down. Construct a vector v , which does point straight up:

g = (0, ax, ay, az) / norm(0, ax, ay, az)
v = (0, 0, 0, 1)

G is normalized so that it is also a unit vector. The cross product between g and v will be a vector normal to both, which is the axis of rotation. As both g and v are unit vectors, the length of their cross product will be sin

a = g   v
  = u sin

The angle between g and v is the dot product of the two vectors, divided by the length of both. As both g and v are unit vectors, the product of their lengths is one, so we have

cos   = g   v

For our quaternion, we need cos /2 and sin /2 which we can get from the half-angle formulae:

cos  /2 =  ((1 + cos  )/2)
sin  /2 =  ((1 - cos  )/2)

Now we construct our quaternion by factoring out sin from the a and:

q = cos  /2 + u sin   sin  /2 / sin

Updating the rotation based on gyro readings The gyro sensor reports the rate of rotation along all three axes, to compute the change in rotation, we take the instantaneous sensor value and multiply it by the time since the last reading and divide by two (because we want half angles for our quaternions). With the three half angles, (x,y,z), we can compute a composite rotation quaternion:

   cos x cos y cos z + sin x sin y sin z +
i (sin x cos y cos z - cos x sin y sin z) +
j (cos x sin y cos z + sin x cos y sin z) +
k (cos x cos y sin z - sin x sin y cos z)

Now we combine this with the previous rotation to construct our current rotation. Doing this faster If we read our sensor fast enough that the angles were a small fraction of a radian, then we could take advantage of this approximation:

sin x   x
cos x   1

that simplifies the above computation considerably:

1 + xyz + i (x - yz) + j (y + xz) + k (z - xy)

And, as x, y, z 1, we can further simplify by dropping the quadratic and cubic elements as insignificant:

1 + ix + jy + kz

This works at our 100Hz sampling rate when the rotation rates are modest, but quick motions will introduce a bunch of error. Given that we ve got plenty of CPU for this task, there s no reason to use this simpler model. If we did crank up the sensor rate a bunch, we might reconsider. Computing the Current Orientation We have a rotation quaternion which maps the flight frame back to the ground frame. To compute the angle from vertical, we simply take a vector in flight frame along the path of flight (0, 0, 0, 1) and rotate that back to the ground frame:

g = r (0 0 0 1) r*

That will be a unit vector in ground frame pointing along the axis of the rocket. The arc-cosine of the Z element will be the angle from vertical. Results All of the above code is checked into the AltOS git repository I added a test mode to the firmware that just dumps out the current orientation over the USB link which lets you play with rotating the board to see how well the system tracks the current orientation. There s a bit of gyro drift, as you d expect, but overall, the system tracks the current orientation within less than a tenth of a degree per second. Even with all of this computation added, the whole flight software is consuming less than 7% of the STM32L CPU time.

5 September 2013

Keith Packard: Airfest-altimeter-testing

Altimeter Testing at Airfest Bdale and I, along with AJ Towns and Mike Beattie, spent last weekend in Argonia, Kansas, flying rockets with our Kloudbusters friends at Airfest 19. We had a great time! AJ and Mike both arrived a week early at Bdale s to build L3 project airframes, and both flew successful cert flights at Airfest! Airfest was an opportunity for us to test fly prototypes of new flight electronics Bdale and I have spent the last few weeks developing, and I thought I d take a few minutes today to write some notes about what we built and flew. TeleMega We ve been working on TeleMega for quite a while. It s a huge step up in complexity from our original TeleMetrum, as it has a raft of external sensors and six pyro circuits. Bdale flew TeleMega in his new fiberglass 4 airframe on a Loki 75mm blue M demo motor. GPS tracking was excellent; you can see here that GPS altitude tracked the barometric sensor timing exactly:

GPS lost lock when the motor lit, but about 3 seconds after motor burnout, it re-acquired the satellite signals and was reporting usable altitude data right away. The GPS reported altitude was higher than the baro sensor, but that can be explained by our approximation of an atmospheric model used to convert pressure into altitude. The rest of the flight was also nominal; TeleMega deployed drogue and main chutes just fine. TeleMetrum We ve redesigned TeleMetrum. The new version uses better sensors (MS5607 baro sensor, MMA6555 accelerometer) and a higher power radio (CC1120 40mW). The board is the same size, all the connectors are in the same places so it s a drop-in replacement, and it s still got two pyro channels and USB for configuration, data download and battery charging. I loaded up my Candy-Cane airframe with a small 5 grain 38mm CTI classic:

The flight computer worked perfectly, but GPS reception was not as good as we d like to see:

Given how well TeleMega was receiving GPS signals, I m hopeful that we ll be able to tweak TeleMetrum to improve performance. TeleMini We ve also redesigned TeleMini. It s still a two-channel flight computer with logging and telemetry, but we ve replaced the baro sensor with the MS5607, added on-board flash for increased logging space and added on-board screw terminals for an external battery and power switch. You can still use one of our 3.7V batteries, but you can also use another battery providing from 3.7 to 15V. I was hoping to finish up the firmware and fly it, but I ran out of time before the launch. The good news is that all of the components of the board have been tested and work correctly, and the firmware is feature complete , meaning we ve gotten all of the features coded, it s just not quite working yet. EasyMini EasyMini is a new product for us. It s essentially the same as a TeleMini, but without a radio. Two channels, baro-only, with logging. Like TeleMini, it includes an on-board USB connector and can use either one of our 3.7V batteries, or an external battery from 3.7V to 15V. EasyMini and TeleMini are the same size, and have holes in the same places, so you can swap between them easily. I flew EasyMini in my Koala airframe with a 29mm 3 grain CTI blue-streak motor. EasyMini successfully deployed the main chute and logged flight data:

We also sent a couple of boards home with Kevin Trojanowski and Greg Rothman for them to play with. TeleGPS TeleGPS is a GPS tracker, incorporating a u-blox Max receiver and a 70cm transmitter. It can send position information via APRS or our usual digital telemetry formats. I was also hoping to have the TeleGPS firmware working, and I spent a couple of nights in the motel coding, but didn t manage to finish up. So, no data from this board either. Production Plans Given the success of the latest TeleMega prototype, we re hoping to have it into production first. We ll do some more RF testing on the bench with the boards to make sure it meets our standards before sending it out for the first production run. The goal is to have TeleMega ready to sell by the end of October. TeleMetrum clearly needs work on the layout to improve GPS RF performance. With the testing equipment that Bdale is in the midst of re-acquiring, it should be possible to finish this up fairly soon. However, the flight firmware looks great, so we re hoping to get these done in time to sell by the end of November. TeleMini is looking great from a hardware perspective, but the firmware needs work. Once the firmware is running, we ll need to make enough test flights to shake out any remaining issues before moving forward with it. EasyMini is also looking finished; I ve got a stack of prototypes and will be getting people to fly them at my local launch in another couple of weeks. The plan here is to build a small batch by hand and get them into the store once we re finished testing, using those to gauge interest before we pay for a larger production run.

Next.

Previous.